Python: Create a "Table Of Contents" with python-docx/lxml
Asked Answered
A

5

13

I'm trying to automate the creation of .docx files (WordML) with the help of python-docx (https://github.com/mikemaccana/python-docx). My current script creates the ToC manually with following loop:

for chapter in myChapters:
    body.append(paragraph(chapter.text, style='ListNumber'))

Does anyone know of a way to use the "word built-in" ToC-function, which adds the index automatically and also creates paragraph-links to the individual chapters?

Thanks a lot!

Antares answered 3/9, 2013 at 15:17 Comment(0)
A
22

The key challenge is that a rendered ToC depends on pagination to know what page number to put for each heading. Pagination is a function provided by the layout engine, a very complex piece of software built into the Word client. Writing a page layout engine in Python is probably not a good idea, definitely not a project I'm planning to undertake anytime soon :)

The ToC is composed of two parts:

  1. the element that specifies the ToC placement and things like which heading levels to include.
  2. the actual visible ToC content, headings and page numbers with dotted lines connecting them.

Creating the element is pretty straightforward and relatively low-effort. Creating the actual visible content, at least if you want the page numbers included, requires the Word layout engine.

These are the options:

  1. Just add the tag and a few other bits to signal Word the ToC needs to be updated. When the document is first opened, a dialog box appears saying links need to be refreshed. The user clicks Yes and Bob's your uncle. If the user clicks No, the ToC title appears with no content below it and the ToC can be updated manually.

  2. Add the tag and then engage a Word client, by means of C# or Visual Basic against the Word Automation library, to open and save the file; all the fields (including the ToC field) get updated.

  3. Do the same thing server-side if you have a SharePoint instance or whatever that can do it with Word Automation Services.

  4. Create an AutoOpen macro in the document that automatically runs the field update when the document is opened. Probably won't pass a lot of virus checkers and won't work on locked-down Windows builds common in a corporate setting.

Here's a very nice set of screencasts by Eric White that explain all the hairy details

Audi answered 3/9, 2013 at 23:27 Comment(1)
Can you guide me on how to go with the first option, using tag and signal, please?Corporate
K
16

Sorry for adding comments to an old post, but I think it may be helpful. This is not my solution, but it has been found there: https://github.com/python-openxml/python-docx/issues/36 Thanks to https://github.com/mustash and https://github.com/scanny

    from docx.oxml.ns import qn
    from docx.oxml import OxmlElement

    paragraph = self.document.add_paragraph()
    run = paragraph.add_run()
    fldChar = OxmlElement('w:fldChar')  # creates a new element
    fldChar.set(qn('w:fldCharType'), 'begin')  # sets attribute on element
    instrText = OxmlElement('w:instrText')
    instrText.set(qn('xml:space'), 'preserve')  # sets attribute on element
    instrText.text = 'TOC \\o "1-3" \\h \\z \\u'   # change 1-3 depending on heading levels you need

    fldChar2 = OxmlElement('w:fldChar')
    fldChar2.set(qn('w:fldCharType'), 'separate')
    fldChar3 = OxmlElement('w:t')
    fldChar3.text = "Right-click to update field."
    fldChar2.append(fldChar3)

    fldChar4 = OxmlElement('w:fldChar')
    fldChar4.set(qn('w:fldCharType'), 'end')

    r_element = run._r
    r_element.append(fldChar)
    r_element.append(instrText)
    r_element.append(fldChar2)
    r_element.append(fldChar4)
    p_element = paragraph._p
Karafuto answered 5/2, 2018 at 12:9 Comment(6)
instrText.text = 'TOC \o "1-3" \h \z \u' # change 1-3 depending on heading levels you need ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 19-20: truncated \uXXXX escapeWaligore
Aha! See this page, where @Sup3rGeo says "I also had to escape \\o \\h \\z \\u for it to work without errors", after which it works for me. I suggest that you update your otherwise excellent answer. Now, if only there were some way to programmatically update the table of contents ;-)Waligore
@Waligore Thank you for the suggestion. Will update the comment.Karafuto
Great (+1) ! That will certainly help others in future (it certainly helped me, once I figured out the double escaping). Now, if only there were some way to programmatically update the table of contents ;-)Waligore
@MawgsaysreinstateMonica or anyone else have you found a way on ubuntu for updating the table of contents?Egocentric
Is there a way to change the resulting TOC's style? For me it always resets to Arial 9.Jetblack
Q
12

Please see explanations in the code comments.

# First set directory where you want to save the file

import os
os.chdir("D:/")

# Now import required packages

import docx
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

# Initialising document to make word file using python

document = Document()

# Code for making Table of Contents

paragraph = document.add_paragraph()
run = paragraph.add_run()

fldChar = OxmlElement('w:fldChar')  # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin')  # sets attribute on element

instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve')  # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u'   # change 1-3 depending on heading levels you need

fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')

fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."

fldChar2.append(fldChar3)

fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')

r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)

p_element = paragraph._p

# Giving headings that need to be included in Table of contents

document.add_heading("Network Connectivity")
document.add_heading("Weather Stations")

# Saving the word file by giving name to the file

name = "mdh2"
document.save(name+".docx")

# Now check word file which got created

# Select "Right-click to update field text"
# Now right click and then select update field option
# and then click on update entire table

# Now,You will find Automatic Table of Contents 
Quartan answered 4/12, 2019 at 7:8 Comment(6)
Thanks for the code snippet, But why do we have to update the Table of content manually. Is there a way to automate that in python script as well?Missie
I think I found out how to auto-update. Add these lines of code for fldChar3. fldChar3 = OxmlElement('w:updateFields') fldChar3.set(qn('w:val'), 'true')Missie
When I open word document, it says that the document is linked to another document. What's the reason for that?Launcher
That's excellent! Nice-to-have: how would you enumerate the headings?Tamica
@Missie Can you please add your approach with autoupdate with full code as a separate answer? I don't get it. ;)Alexandros
See also github.com/elapouya/python-docx-template/issues/…Alexandros
R
5

@Mawg // Updating ToC

Had the same issue to update the ToC and googled for it. Not my code, but it works:

word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(input_file_name)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()
Respire answered 14/3, 2019 at 11:34 Comment(1)
This needs windows environment. #62159202 Will need to explore on how to do this step in Ubuntu/Mac environments.Missie
B
0
def fields(self,new_doc):
    para = new_doc.add_paragraph("Table of Content")
    para.alignment = WD_ALIGN_PARAGRAPH.CENTER 
    for run in para.runs:
        run.font.name = 'Arial'
        run.font.size = Pt(14)
        run.bold = True
        run.underline = True
    paragraph = new_doc.add_paragraph()
    paragraph.paragraph_format.space_before = Inches(0)
    paragraph.paragraph_format.space_after = Inches(0)
    run = paragraph.add_run()

    fldChar = OxmlElement('w:fldChar')  # creates a new element
    fldChar.set(qn('w:fldCharType'), 'begin')  # sets attribute on element

    instrText = OxmlElement('w:instrText')
    instrText.set(qn('xml:space'), 'preserve')  # sets attribute on element
    instrText.text = 'TOC \\o "1-3" \\h \\z \\u'   # change 1-3 depending on heading levels you need

    fldChar2 = OxmlElement('w:fldChar')
    fldChar2.set(qn('w:fldCharType'), 'separate')

    fldChar3 = OxmlElement('w:t')
    fldChar3.text = "Right-click to update field."
    fldChar3 = OxmlElement('w:updateFields') 
    fldChar3.set(qn('w:val'), 'true') 
    fldChar2.append(fldChar3)

    fldChar4 = OxmlElement('w:fldChar')
    fldChar4.set(qn('w:fldCharType'), 'end')

    r_element = run._r
    r_element.append(fldChar)
    r_element.append(instrText)
    r_element.append(fldChar2)
    r_element.append(fldChar4)

    p_element = paragraph._p

this is the code which generate automatically table of content.

but i have a doubt how can be create List of Figure page which contain all the figure captions and page number toggle field.

Breaux answered 30/10, 2023 at 10:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.